github header

Group 1 Final

1 Introduction

Heart disease is one of the most common diseases and a leading cause of death in the United States. This dataset takes data from the CDC for the year 2020 for people with and without heart disease. It includes health-related data including BMI, whether someone is a smoker, the amount of physical activity, age, race, and other variables. Our hope is that by studying how different health variables relate to instances of heart disease, we can determine if there are significant factors that can predict heart disease or are correlated with heart disease. In addition to this, the dataset includes a measure of mental health. We were interested in what factors can affect mental health. For instance, drinking, smoking, and physical activity were predicted to have some impact on overall mental health. Lastly, we want to look at the relation between BMI and physical activity. There have been some recent studies that BMI does not have any correlation to physical health, so we’d like to use this dataset to explore that relation.

1.1 Background

Healthy habits are defined as various terms that have been found in this database, such as: Eat a plant-based diet, average of sleep, mental health, physical activity and so forth. Furthermore, mental health tends to be related to physical health and this holistic relationship apparently has repercussions in diseases as serious as heart disease. This database shows an important relationship between the variables to know how are the decisions of the people interviewed and to estimate which habits lead to a degraded mental health, a physical health at risk, and aspects that even lead to heart disease.

1.2 Description of the Dataset

This data comes from a 2020 survey from the CDC on health status, used to study overall health and potential contributors to heart disease. The original dataset had 279 variables and over 400,000 rows, but the version which was uploaded to Kaggle contains 18 variables which could potentially influence heart disease and just over 320,000 complete rows, so there are no NA’s and all 18 of the variables were taken into account in some way. The 18 variables consist of the following: HeartDisease, BMI, Smoking, AlcoholDrinking, Stroke, PhysicalHealth, MentalHealth, DiffWalking, Sex, AgeCategory, Race, Diabetic, PhysicalActivity, GenHealth, SleepTime, Asthma, KidneyDisease, SkinCancer. Most of these are straightforward variables that correspond with their name, but there are a few which require further explanation.

The people interviewed for this survey would answer “Yes” for Smoking if they have smoked at least 100 cigarettes in their entire lifetime and “Yes” to AlcoholDrinking if they are considered heavy drinkers (more than 14 drinks per week for men and 7 for women). PhysicalHealth and MentalHealth are numerical variables which give the number days in the past 30 days during which their physical or mental health, respectively, could be considered not good. That means that lower values correspond to less days of poor health. Recipients answered “Yes” to DiffWalking if they have any difficulty walking or climbing stairs. Diabetic is a four-level factor variable which records if they have ever had diabetes with the following responses: “No”, “Borderline”, “Yes (during pregnancy)”, or “Yes”. PhysicalActivity records whether the recipients reported any physical activity in the past 30 days outside of their regular job. The rest of the variables should be relatively self-explanatory given their names.

2 Understanding the Data

2.1 Dataset Summary

Importing the dataset and original data structure:

## 'data.frame':    319795 obs. of  18 variables:
##  $ HeartDisease    : chr  "No" "No" "No" "No" ...
##  $ BMI             : num  16.6 20.3 26.6 24.2 23.7 ...
##  $ Smoking         : chr  "Yes" "No" "Yes" "No" ...
##  $ AlcoholDrinking : chr  "No" "No" "No" "No" ...
##  $ Stroke          : chr  "No" "Yes" "No" "No" ...
##  $ PhysicalHealth  : num  3 0 20 0 28 6 15 5 0 0 ...
##  $ MentalHealth    : num  30 0 30 0 0 0 0 0 0 0 ...
##  $ DiffWalking     : chr  "No" "No" "No" "No" ...
##  $ Sex             : chr  "Female" "Female" "Male" "Female" ...
##  $ AgeCategory     : chr  "55-59" "80 or older" "65-69" "75-79" ...
##  $ Race            : chr  "White" "White" "White" "White" ...
##  $ Diabetic        : chr  "Yes" "No" "Yes" "No" ...
##  $ PhysicalActivity: chr  "Yes" "Yes" "Yes" "No" ...
##  $ GenHealth       : chr  "Very good" "Very good" "Fair" "Good" ...
##  $ SleepTime       : num  5 7 8 6 8 12 4 9 5 10 ...
##  $ Asthma          : chr  "Yes" "No" "Yes" "No" ...
##  $ KidneyDisease   : chr  "No" "No" "No" "No" ...
##  $ SkinCancer      : chr  "Yes" "No" "No" "Yes" ...

2.2 Cleaning the Dataset

*All but five of the variables were set to factors. Most factor variable had 2 levels (yes or no questions), but some had up to six levels.

*In the variable Race, “American Indian/Alaskan Native” was redefined to “Native” in order to conserve space on plots and tables, but it should be noted that these two groups make up that level

*In the variable Diabetic, “No, borderline diabetes” was redefined to “Borderline” in order to conserve space. There is no information lost in doing this.

*An order to the factor variables Race, Diabetic, and GenHealth was established to keep the orders uniform across plots and tables. The order for Race is based on relative frequency (with “Other” being at the end) and the other two varriables were put in a logical order.

*The variable AgeCategory was replaced with Age so that it could be used as a numerical variable. A random value was chosen in the range given by AgeCategory, that value was set to Age, and unnecessary variables were deleted.

## 'data.frame':    319795 obs. of  18 variables:
##  $ HeartDisease    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ BMI             : num  16.6 20.3 26.6 24.2 23.7 ...
##  $ Smoking         : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 2 1 1 ...
##  $ AlcoholDrinking : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Stroke          : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ PhysicalHealth  : num  3 0 20 0 28 6 15 5 0 0 ...
##  $ MentalHealth    : num  30 0 30 0 0 0 0 0 0 0 ...
##  $ DiffWalking     : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 1 2 1 2 ...
##  $ Sex             : Factor w/ 2 levels "Female","Male": 1 1 2 1 1 1 1 1 1 2 ...
##  $ Race            : Factor w/ 6 levels "White","Hispanic",..: 1 1 1 1 1 3 1 1 1 1 ...
##  $ Diabetic        : Factor w/ 4 levels "No","Borderline",..: 4 1 4 1 1 1 1 4 2 1 ...
##  $ PhysicalActivity: Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 2 1 1 2 ...
##  $ GenHealth       : Factor w/ 5 levels "Poor","Fair",..: 4 4 2 3 4 2 2 3 2 3 ...
##  $ SleepTime       : num  5 7 8 6 8 12 4 9 5 10 ...
##  $ Asthma          : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 2 1 1 ...
##  $ KidneyDisease   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 2 1 ...
##  $ SkinCancer      : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 2 1 1 1 ...
##  $ Age             : num  59 81 66 76 41 75 71 84 83 65 ...

3 Exploratory Data Analysis

3.1 Understanding the Data

These pie charts give the relative frequency of a few key factor variables. The show that a majority of people do not have heart disease, have not smoked, are not heavy drinkers, are white, and are female.

These histograms give a brief look at the numerical variables. BMI has an average value of 28.3 with a right skew. A majority of people reported 0 days of poor physical and mental health over the past 30 days. The average for sleep time and age is 7.1 hours and 54.6 years, respectively.

3.2 Smart Question: What variables affect instances of heart disease?

Instances of Heart Disease:
V1
No 292422
Yes 27373
Heart Disease vs. Smoking:
No Yes
No 176551 115871
Yes 11336 16037
Heart Disease vs. Drinking:
No Yes
No 271786 20636
Yes 26232 1141
Heart Disease vs. Diabetes:
No Borderline Yes (during pregnancy) Yes
No 252134 5992 2451 31845
Yes 17519 789 108 8957
Heart Disease vs. Stroke:
No Yes
No 284742 7680
Yes 22984 4389
Heart Disease vs. Difficulty Walking:
No Yes
No 258040 34382
Yes 17345 10028
Heart Disease vs. Physical Activity:
No Yes
No 61954 230468
Yes 9884 17489
Heart Disease vs. Asthma:
No Yes
No 254483 37939
Yes 22440 4933
Heart Disease vs. Kidney Disease:
No Yes
No 284098 8324
Yes 23918 3455
Heart Disease vs. Skin Cancer:
No Yes
No 267583 24839
Yes 22393 4980

These are various tables which compare factor variables with instances of heart disease. To summarize briefly, there were more instances of heart disease with people who had smoked, were heavy drinkers, had a stroke, had difficulty walking, were not physically active, had asthma, had kidney disease, had diabetes, and had skin cancer.

This boxplot looks at the distribution of BMI values for people with and without heart disease. The median BMI was 27.3 for people without heart disease and 28.3 for people with heart disease.

These two plots look at instances of heart disease compared to both age and sex. The median age of people with heart disease was 70 years versus 55 years for people without. The area plot demonstrates an increase in the percentage of people with heart disease as their age increases and also shows that men are more likely to have heart disease than women.

These two bar graphs and split violin plot look at general health for people with and without heart disease. The most common response was “Good” health for people with heart disease and “Very good” for people without heart disease. The violin plot confirms the relationship between heart disease and age while showing the distribution of people in each health category versus age in each case.

This split violin plot and accompanying bar graph look at race and age versus heart disease. The violin plot shows the unexpected result that non-white people report having heart disease at younger ages when compared to white people. The bar graph shows that a higher percentage of white people get heart disease (9.2%) and a lower percentage of Asian people (3.3%) have heart disease, with the other races falling somewhere between those two values.

3.3 Smart Question: What variables affect mental health?

These four boxplots look at what variables affect mental health. It should be noted that since over 60% of total recipients reported 0 days of poor mental health, those instances were omitted, so these plots look at people who have reported at least 1 day of poor mental health. They show that people have less poor mental health days when they don’t smoke, aren’t heavy drinkers, are physically active, and get the recommended amount of sleep (7 to 9 hours per night).

3.4 Smart Question: Does BMI have any effect on physical health?

This boxplot compares general health categories to recorded BMI. Those who reported “Excellent” health had the lowest median BMI (25.4) and those who reported “Fair” health had the highest median BMI (29.4).

4 Testing

4.1 Chi-Square Test

The Chi-square test of independence is a statistical hypothesis test used to determine whether two variables are likely to be related or not. We conduct a couple of chi-square test to check if the variables are independent or not.

4.1.2 Does the data support that race very much affects heart disease?

We want to check if the Heart Disease variable is related to the Race variable. We conduct another chi-square test to check whether Heart Disease and Race are independent. H0: Heart Disease and race are independent from each other. H1: Heart Disease and race are not independent from each other.

##                                 HeartDisease
## Race                                 No    Yes
##   American Indian/Alaskan Native   4660    542
##   Asian                            7802    266
##   Black                           21210   1729
##   Hispanic                        26003   1443
##   Other                           10042    886
##   White                          222705  22507
## 
##  Pearson's Chi-squared test
## 
## data:  racetable
## X-squared = 844, df = 5, p-value <0.0000000000000002

The table shows how many people are suffering from heart disease and otherwise according to the race. As stated by the table, white people are more likely to suffer from heart disease compare to other races. Additionally, the chi-square looked whether the variables are related or not, and with a p-value of 2.2e-16 we were able to reject null Hypothesis supporting evidence that race has effects on heart disease.

4.1.3 Does the data support that Heart Disease has an effect on Gen Health?

The last chi-square test conducted is to check if the variable heart disease is independent to general health. H0: Heart Disease and Gen Health are independent from each other. H1: Heart Disease and Gen Health are not independent from each other.
Contingency table for Heart Disease vs Gen Health
Excellent Fair Good Poor Very good
No 65342 27593 83571 7439 108477
Yes 1500 7084 9558 3850 5381
## 
##  Pearson's Chi-squared test
## 
## data:  gentable
## X-squared = 21542, df = 4, p-value <0.0000000000000002

The contingency table shows the number of active heart disease patients are heaving general health condition in five categories. The Chi-square is intended to show if the variables are related or not, and with a p-value of 2.2e-16 we reject the null Hypothesis leading to evidence to support that Heart Disease has an effect on Gen Health variable.

4.2 T-test

4.2.1 What is the average age people are suffering from heart disease?

On our data set Heart Disease is a factor variable and Age is numeric variable. Therefore, to find the average age of people having heart disease we have chosen the T-test for this purpose. A t-test compares the mean of the sample data to a known value. By conducting the t-test, the average value (mean value) of people’s age having heart disease was found. Subsequently, values from t-test are analyzed to find the average age of people suffering from heart diseases.

## 
##  One Sample t-test
## 
## data:  heart_disease_on$Age
## t = 929, df = 27372, p-value <0.0000000000000002
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  68.0 68.3
## sample estimates:
## mean of x 
##      68.2

After sub setting the dataset to split the values where HeartDisease factor variable is Yes, we conducted the t-test to know the average age.The result shows that the value average of having heart disease is 68.

4.3 Test For Association

The correlation test is used to evaluate the association between two or more variables. Pearson’s can range from −1 to 1, and an R-squared of −1 indicates a perfect negative linear relationship between variables, an R-squared of 0 indicates no linear relationship between variables, and an R-squared of 1 indicates a perfect positive linear relationship between variables.

4.3.1 What variables affect mental health physical health? In particular, does alcohol drinking, smoking

To know the effect of smoking and drinking on mental and physical health Pearson’s method of cor test is used.

## 
##  Pearson's product-moment correlation
## 
## data:  heartdata$MentalHealth and as.numeric(heartdata$Smoking)
## t = 48, df = 319793, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0817 0.0886
## sample estimates:
##    cor 
## 0.0852
## 
##  Pearson's product-moment correlation
## 
## data:  heartdata$MentalHealth and as.numeric(heartdata$AlcoholDrinking)
## t = 29, df = 319793, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0478 0.0547
## sample estimates:
##    cor 
## 0.0513

Smoking and drinking variables do not have enough strong correlation with mental health. Here, the Cor value of smoking is 0.08515729, drinking alcohol value is 0.05128197.Thus, smoking has a stronger correlation with mental health than drinking alcohol with the same.

## 
##  Pearson's product-moment correlation
## 
## data:  heartdata$PhysicalHealth and as.numeric(heartdata$Smoking)
## t = 66, df = 319793, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.112 0.119
## sample estimates:
##   cor 
## 0.115
## 
##  Pearson's product-moment correlation
## 
## data:  heartdata$PhysicalHealth and as.numeric(heartdata$AlcoholDrinking)
## t = -10, df = 319793, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.0207 -0.0138
## sample estimates:
##     cor 
## -0.0173

Smoking and drinking variables do not have enough strong correlation with physical health. Here, the Cor value of smoking is 0.1153524, drinking alcohol value is -0.01725429. Thus, smoking has a stronger correlation with physical health than drinking alcohol, as this latter one is negatively correlated.

5 Model building

5.1 SMART Question

What variables affect instances of heart disease?

Our goal is to find out the people who are likely to have heart disease in the future, so we can take some actions like a more detailed physical examination before the conditions become worse.

5.2 Pre-processing amd balancing the data

The first step is to perform some pre-processing work.

First, because we will use bestglm::bestglm(), a feature selection method, to decide which variables are essential and which are not, we must clean the dataset with the target variable renamed y and all other unused variables removed from the dataset. Thus, we put the HeartDisease column at the end of the dataset and renamed it as y.

Second, considering that there are few rows with the value “Yes (during pregnancy)” in the Diabetic variable, we combine the value “Yes (during pregnancy)” and “Yes” together in the Diabetic variable.

The following codes are the two pre-processing steps in the model building part.

After preprocessing, we need to balance the data. Let us look at the proportion of heart disease data before we continue our research.

## 
##     No    Yes 
## 292422  27373

We can find that the dataset is very unbalanced. Only 8.6% of the dataset has the value of 1 for y (HeartDisease). Considering that the dataset is large, we use undersampling methods to balance the dataset. After the balancing work, the value zero and value one of y variable are the same. We have used the following reference for different balancing methods: https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/

We can find that the data is really balanced now.

## 
##    No   Yes 
## 27373 27373

Before we begin the logistic regression model, let us look at the structure of the dataset now.

## 'data.frame':    54746 obs. of  18 variables:
##  $ BMI             : num  30.9 21.9 25.8 25.8 30.1 ...
##  $ Smoking         : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
##  $ AlcoholDrinking : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ Stroke          : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PhysicalHealth  : num  0 0 1 5 1 0 0 0 3 0 ...
##  $ MentalHealth    : num  0 0 0 0 0 0 0 3 10 0 ...
##  $ DiffWalking     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sex             : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 1 2 1 1 1 ...
##  $ Race            : Factor w/ 6 levels "American Indian/Alaskan Native",..: 6 6 6 4 6 6 6 6 6 6 ...
##  $ Diabetic        : Factor w/ 3 levels "No","No, borderline diabetes",..: 1 1 1 1 1 1 1 1 3 1 ...
##  $ PhysicalActivity: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ GenHealth       : num  4 4 3 4 3 3 2 3 2 3 ...
##  $ SleepTime       : num  8 6 8 6 6 8 5 8 6 8 ...
##  $ Asthma          : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ KidneyDisease   : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
##  $ SkinCancer      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ Age             : num  58 28 82 34 59 66 27 31 61 60 ...
##  $ y               : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

5.3 logistic regression model

We split the dataset into two parts to train and evaluate the model later. 80% of the dataset will be used to train the model, and the rest (20%) will be used to test the model’s accuracy. I will use createDataPartition in the caret library to split the dataset.

After having the data split, the training dataset is used to build the model. First, we use all the variables as independent variables and make a model as below.

## 
## Call:
## glm(formula = y ~ ., family = binomial(logit), data = data_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9445  -0.7856  -0.0188   0.8142   2.9983  
## 
## Coefficients:
##                                  Estimate Std. Error z value
## (Intercept)                     -3.383373   0.144227  -23.46
## BMI                              0.011842   0.001979    5.98
## SmokingYes                       0.390439   0.024282   16.08
## AlcoholDrinkingYes              -0.174367   0.053397   -3.27
## StrokeYes                        1.191791   0.050745   23.49
## PhysicalHealth                   0.003855   0.001549    2.49
## MentalHealth                     0.005636   0.001583    3.56
## DiffWalkingYes                   0.212683   0.033270    6.39
## SexMale                          0.737279   0.024622   29.94
## RaceAsian                       -0.481915   0.133290   -3.62
## RaceBlack                       -0.372107   0.101400   -3.67
## RaceHispanic                    -0.210721   0.102362   -2.06
## RaceOther                       -0.176508   0.112251   -1.57
## RaceWhite                       -0.195594   0.091659   -2.13
## DiabeticNo, borderline diabetes  0.193157   0.074438    2.59
## DiabeticYes                      0.484186   0.030476   15.89
## PhysicalActivityYes             -0.033862   0.028341   -1.19
## GenHealth                       -0.502894   0.014087  -35.70
## SleepTime                       -0.030457   0.007679   -3.97
## AsthmaYes                        0.299073   0.034472    8.68
## KidneyDiseaseYes                 0.663414   0.052078   12.74
## SkinCancerYes                    0.145502   0.035878    4.06
## Age                              0.058646   0.000947   61.91
##                                             Pr(>|z|)    
## (Intercept)                     < 0.0000000000000002 ***
## BMI                                    0.00000000220 ***
## SmokingYes                      < 0.0000000000000002 ***
## AlcoholDrinkingYes                           0.00109 ** 
## StrokeYes                       < 0.0000000000000002 ***
## PhysicalHealth                               0.01279 *  
## MentalHealth                                 0.00037 ***
## DiffWalkingYes                         0.00000000016 ***
## SexMale                         < 0.0000000000000002 ***
## RaceAsian                                    0.00030 ***
## RaceBlack                                    0.00024 ***
## RaceHispanic                                 0.03953 *  
## RaceOther                                    0.11585    
## RaceWhite                                    0.03285 *  
## DiabeticNo, borderline diabetes              0.00946 ** 
## DiabeticYes                     < 0.0000000000000002 ***
## PhysicalActivityYes                          0.23216    
## GenHealth                       < 0.0000000000000002 ***
## SleepTime                              0.00007306129 ***
## AsthmaYes                       < 0.0000000000000002 ***
## KidneyDiseaseYes                < 0.0000000000000002 ***
## SkinCancerYes                          0.00005003125 ***
## Age                             < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 60714  on 43795  degrees of freedom
## Residual deviance: 43297  on 43773  degrees of freedom
## AIC: 43343
## 
## Number of Fisher Scoring iterations: 5

We can find from the model that the p-values of Race and PhysicalActivity are more significant than 0.05, which means these two variables are insignificant. So we drop these two variables and make the second logistic regression model again.

We will quickly check two things for this model. First, the p-values. Since a P-value below .05 indicates significance, which means the coefficient or so-called parameters that our model estimates are reliable. And second, the pseudo R squared. This value ranging from 0 to 1 indicates how much variance our model explains.

We can find that all the p-values of the model indicate significance, meaning that our model is a legitimate one. An R squared of 0.29 tells that 29 percent of the variance is explained.

After we finish this, we can have a look at the Variance Inflation Factor (vif).

  • When 1 < vif < 5, it means the variables are mildly correlated. It’s acceptable.
  • When 5 < vif < 10, it means moderately correlated, and it also can be acceptable.
  • When vif > 10, it’s not acceptable.
##                             BMI                      SmokingYes 
##                            7.12                            6.38 
##              AlcoholDrinkingYes                       StrokeYes 
##                            6.52                            9.62 
##                  PhysicalHealth                    MentalHealth 
##                           10.43                            8.06 
##                  DiffWalkingYes                         SexMale 
##                            8.71                            6.58 
## DiabeticNo, borderline diabetes                     DiabeticYes 
##                            5.59                            7.09 
##                       GenHealth                       SleepTime 
##                           11.09                            6.68 
##                       AsthmaYes                KidneyDiseaseYes 
##                            6.80                            8.44 
##                   SkinCancerYes                             Age 
##                            6.40                           11.15

We can find that some vif values are larger than 10, which means these variables are highly correlated and not acceptable. So we tried to drop one variable at a time. In the meantime, we checked at the p-value to ensure that the variables are significant. In the end, we got the model below:

## 
## Call:
## glm(formula = y ~ . - Race - PhysicalActivity - Age - Asthma - 
##     PhysicalHealth, family = binomial(logit), data = data_train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.370  -0.858  -0.120   0.941   2.297  
## 
## Coefficients:
##                                 Estimate Std. Error z value
## (Intercept)                      0.21961    0.08471    2.59
## BMI                             -0.00625    0.00181   -3.45
## SmokingYes                       0.48167    0.02257   21.34
## AlcoholDrinkingYes              -0.40284    0.04980   -8.09
## StrokeYes                        1.36603    0.04933   27.69
## MentalHealth                    -0.01653    0.00141  -11.74
## DiffWalkingYes                   0.62529    0.03037   20.59
## SexMale                          0.59018    0.02277   25.91
## DiabeticNo, borderline diabetes  0.42327    0.07167    5.91
## DiabeticYes                      0.71741    0.02907   24.68
## GenHealth                       -0.57790    0.01223  -47.24
## SleepTime                        0.03334    0.00728    4.58
## KidneyDiseaseYes                 0.85136    0.05099   16.70
## SkinCancerYes                    0.74038    0.03395   21.81
##                                             Pr(>|z|)    
## (Intercept)                                  0.00953 ** 
## BMI                                          0.00055 ***
## SmokingYes                      < 0.0000000000000002 ***
## AlcoholDrinkingYes                0.0000000000000006 ***
## StrokeYes                       < 0.0000000000000002 ***
## MentalHealth                    < 0.0000000000000002 ***
## DiffWalkingYes                  < 0.0000000000000002 ***
## SexMale                         < 0.0000000000000002 ***
## DiabeticNo, borderline diabetes   0.0000000035125791 ***
## DiabeticYes                     < 0.0000000000000002 ***
## GenHealth                       < 0.0000000000000002 ***
## SleepTime                         0.0000047081971992 ***
## KidneyDiseaseYes                < 0.0000000000000002 ***
## SkinCancerYes                   < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 60714  on 43795  degrees of freedom
## Residual deviance: 48335  on 43782  degrees of freedom
## AIC: 48363
## 
## Number of Fisher Scoring iterations: 4

And also checked the vif. We can find that all the vif values are all below 10.

##                             BMI                      SmokingYes 
##                            6.03                            5.58 
##              AlcoholDrinkingYes                       StrokeYes 
##                            5.68                            9.12 
##                    MentalHealth                  DiffWalkingYes 
##                            6.39                            7.44 
##                         SexMale DiabeticNo, borderline diabetes 
##                            5.66                            5.21 
##                     DiabeticYes                       GenHealth 
##                            6.51                            8.53 
##                       SleepTime                KidneyDiseaseYes 
##                            6.02                            8.09 
##                   SkinCancerYes 
##                            5.81

5.4 Feature selection

In this part, we want to use feature selection to find out the most suitable variables from our current model. Unfortunately, the training dataset has more than 40,000 rows, which is significant and takes much time to run. So we changed the test dataset to make the feature selection. The test dataset has the same data structure but fewer rows.

Although lacking intuitive visual presentation of results, bestglm::bestglm() can handle logistic regression. Thus, we used it to do the feature selection.

## Fitting algorithm:  AIC-glm
## Best Model:
##               df deviance
## Null Model 10936    12036
## Full Model 10949    15180
## 
##  likelihood-ratio test - GLM
## 
## data:  H0: Null Model vs. H1: Best Fit AIC-glm
## X = 3144, df = 13, p-value <0.0000000000000002
##     BMI Smoking AlcoholDrinking Stroke MentalHealth DiffWalking  Sex Diabetic
## 1  TRUE    TRUE            TRUE   TRUE         TRUE        TRUE TRUE     TRUE
## 2  TRUE    TRUE            TRUE   TRUE         TRUE        TRUE TRUE     TRUE
## 3  TRUE    TRUE           FALSE   TRUE         TRUE        TRUE TRUE     TRUE
## 4  TRUE    TRUE           FALSE   TRUE         TRUE        TRUE TRUE     TRUE
## 5 FALSE    TRUE            TRUE   TRUE         TRUE        TRUE TRUE     TRUE
##   GenHealth SleepTime KidneyDisease SkinCancer Criterion
## 1      TRUE      TRUE          TRUE       TRUE     12062
## 2      TRUE     FALSE          TRUE       TRUE     12064
## 3      TRUE      TRUE          TRUE       TRUE     12067
## 4      TRUE     FALSE          TRUE       TRUE     12070
## 5      TRUE      TRUE          TRUE       TRUE     12070
##     BMI          Smoking        AlcoholDrinking  Stroke        MentalHealth  
##  Mode :logical   Mode:logical   Mode :logical   Mode:logical   Mode:logical  
##  FALSE:1         TRUE:5         FALSE:2         TRUE:5         TRUE:5        
##  TRUE :4                        TRUE :3                                      
##                                                                              
##                                                                              
##                                                                              
##  DiffWalking      Sex          Diabetic       GenHealth      SleepTime      
##  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode :logical  
##  TRUE:5         TRUE:5         TRUE:5         TRUE:5         FALSE:2        
##                                                              TRUE :3        
##                                                                             
##                                                                             
##                                                                             
##  KidneyDisease  SkinCancer       Criterion    
##  Mode:logical   Mode:logical   Min.   :12062  
##  TRUE:5         TRUE:5         1st Qu.:12064  
##                                Median :12067  
##                                Mean   :12067  
##                                3rd Qu.:12070  
##                                Max.   :12070

The feature selection shows that the best model has all these 13 variables, and its CIA (Akaike Information Criterion) is 12062, which is the lowest among these models.

5.5 Model Evaluation

In this part, we will use AUC and confusion matrix to evaluate the model.

5.5.1 ROC and AUC

Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC) measure the true positive rate (or sensitivity) against the false positive rate (or specificity). The area-under-curve is always between 0.5 and 1. Values higher than 0.8 are considered a good model fit.

The AUC of the model is 0.795, which is a little bit lower than 0.8. Because our model looks suitable and we have all the needed features, we assume that the data causes the lower AUC value.

5.5.2 Confusion matrix

We can then have a look at the Confusion matrix.

Confusion matrix from Logit Model
Predicted No Predicted Yes Total
Actual No 16646 5252 21898
Actual Yes 6957 14941 21898
Total 23603 20193 43796

We can find from the confusion matrix that Precision is 14941/(5252+14941) = 0.74, which means the valid of the result is 74%. And the recall is 14941/(6957+14941) = 0.68, which means how complete the results are 68%.

In our model, actually, we consider the recall is more important, because FN means heart disease patients who are missed by our model, which can cause a harmful result.

##     No  Yes
## 0 4220 1753
## 1 1255 3722

Then we can use the test dataset to checkout whether the model is good to use. So I used the data_test to make a prediction and calculate the confusion matrix by the test dataset. The Precision is 3722/(1255+3722) = 0.75 and the recall is 3722/(1753+3722) = 0.68.

The Precision value and recall value of the test dataset is similiar to the train dataset, which means our model is reliable to predict the heart disease.

5.6 Interpretation and Reporting

We’ll return to our logistic regression model for a minute, and look at the estimated parameters (coefficients). Since the model’s parameter the recorded in logit format, we transformed it into odds ratio so that it’ll be easier to interpret. After transforming, we sorted the variables by the coefficient values.

## # A tibble: 14 × 3
##    term                            estimate statistic
##    <chr>                              <dbl>     <dbl>
##  1 StrokeYes                          3.92      27.7 
##  2 KidneyDiseaseYes                   2.34      16.7 
##  3 SkinCancerYes                      2.10      21.8 
##  4 DiabeticYes                        2.05      24.7 
##  5 DiffWalkingYes                     1.87      20.6 
##  6 SexMale                            1.80      25.9 
##  7 SmokingYes                         1.62      21.3 
##  8 DiabeticNo, borderline diabetes    1.53       5.91
##  9 (Intercept)                        1.25       2.59
## 10 SleepTime                          1.03       4.58
## 11 BMI                                0.994     -3.45
## 12 MentalHealth                       0.984    -11.7 
## 13 AlcoholDrinkingYes                 0.668     -8.09
## 14 GenHealth                          0.561    -47.2

We can find from the table that other diseases (stroke, kidney disease, diabetic, SkinCancer), general health conditions, sex, DiffWalking (serious difficulty walking or climbing stairs), and smoking habit all largely influence the possibility of heart disease. It’s a little weird that drinking alcohol will reduce the possibility of heart disease. As we always think, drinking is not a good habit. Maybe the data also includes people who drink some little wine.

6 Classification Trees

6.1 First Classification Tree

For the purposes of the decision tree, observations were assigned for the variable mental health with categorical values “Yes” and “No”. Its initial values range between 1 and 30 as a response to how many days on the previous month people interviewed felt their mental health was not good. In this sense, a logic if else argument responds to the condition if people felt their mental health was no good for 15 or more days then assign value “Yes”, otherwise assign value “No” (MentalHealth>=15, “No”, “Yes”).

This procedure was pertinent for further process on creating the decision tree with the relevant variables. In this first decision tree the following variables were selected: HighMh, Smoking, Sex, AlcoholDrinking, and PhysicalActivity.

After sub setting the data a training model was created to predict the class or value of the target variable, which in this case is Smoking, by learning simple decision rules inferred from this data training.

With the training dataset created, a tree was built responding to smoking as the target value. The results show 13 different nodes for each variable. The actual tree starts with the root node labeled 1). observations and a default decision of No. There are 107000 observations with Yes as the decision, so these are lost if we make the decision No for all observations. The probability of No is reported as 0.58 and of Yes us 0.41. The root node is split into two branches, nodes number 2 and 4. For node number 2, the split corresponds to those observations for which AlcoholDrinking is equal to No. This accounts for 238097 observations and whilst 96200 of them are Yes. The majority (with a proportion of 0.596) are No.  Going forward with interpreting the nodes, it is concluded that with a proportion of 60% people who drinks alcohol, and out of this 42% are male and practice physical activity, 41% percent reported more than 15 days in the prior month where their mental health seemed to be affected.

Graphically the tree looks like the following plot, and this visually represents, as another conclusion, that starting on node 5 representing 51% of people who do not workout at all throughout the month, there are 45% female from whom 42% felt their mental health was not good for 15+ days in the past 30 days. The following plot is more visually appealing and resumes the conclusions described above. Additionally, from the tree fit function specific observations were selected to obtain the prediction and the rule used to make that prediction based on the target variable. The results are the following:

##  Smoking                                                                                      
##     0.34 when AlcoholDrinking is  No & PhysicalActivity is Yes & Sex is Female                
##     0.41 when AlcoholDrinking is  No & PhysicalActivity is Yes & Sex is   Male & HighMH is Yes
##     0.43 when AlcoholDrinking is  No & PhysicalActivity is  No & Sex is Female & HighMH is Yes
##     0.52 when AlcoholDrinking is  No & PhysicalActivity is Yes & Sex is   Male & HighMH is  No
##     0.55 when AlcoholDrinking is  No & PhysicalActivity is  No & Sex is   Male                
##     0.57 when AlcoholDrinking is  No & PhysicalActivity is  No & Sex is Female & HighMH is  No
##     0.62 when AlcoholDrinking is Yes

6.2 Second Classification Tree

A second classification tree was built to understand the behavior of different variables interacting with the target variable smoking.The variables used to construct the model were: Age, Sleep Time, Race, Heart Disease, and Physical Activity.

For this classification tree, the variable Age was turned into ‘ifelse logic’ to see the pattern for people of 30+ years old, and the variable name assigned was: Age30Plus.

Similarly, the variable sleep was turned into ‘ifelse logic’ to see the pattern for people sleeping 7+ hours, and the variable name assigned was AvSleep.

After sub setting the data a training model was created to predict the class or value of the target variable, which in this case is Smoking, by learning simple decision rules inferred from this data training.

With the training dataset created, a tree was built responding to smoking as the target value. The results show 9 different nodes, starting with the root node labeled 1) observations and a default decision of No and this accounts for 58% of the data. This node splits into those who are 30 or more (22%) and those who are less than 30 years old (44%). Following the results, node 6 shows that 57% don’t have any heart disease, which leads to node 12 where 40% perform some sort of physical activity, and 49% do not perform any physical activity at all, which can be seen in node 13. Meanwhile, nodes 25 and 27 correspond to races from which node labeled as 26 indicates that 38% are Asian, Black or Hispanic, and node labeled 27 indicates54% are American Indian, Alaskan native, white or other.

Graphically the tree looks like the following plot, and this visually represents, as another conclusion, that surprisingly from node 3, for those less than 30 years old, there is 59% chance of having a heart disease without accounting the other variables.

The following plot is more visually appealing and resumes the conclusions described above. Additionally, from the tree fit function specific observations were selected to obtain the prediction and the rule used to make that prediction based on the target variable. The results are the following:

##  Smoking                                                                                                                                 
##     0.22 when Age30Plus is Yes                                                                                                           
##     0.39 when Age30Plus is  No & HeartDisease is  No & PhysicalActivity is  No & Race is                       Asian or Black or Hispanic
##     0.41 when Age30Plus is  No & HeartDisease is  No & PhysicalActivity is Yes                                                           
##     0.54 when Age30Plus is  No & HeartDisease is  No & PhysicalActivity is  No & Race is American Indian/Alaskan Native or Other or White
##     0.59 when Age30Plus is  No & HeartDisease is Yes

7 Conclusion

According to the CDC, heart disease is the leading cause of death for men, women, and people of most racial and ethnic groups in the United States (CDC, 2022). After performing the EDA, distributions, tests, regression models, and decision tree we concluded that smoking, strokes, asthma, difficulty walking, kidney disease, diabetes, and skin cancer affect instances of heart disease. In addition to this analysis found in the database, we wanted to explore the response that these instances have on mental health and physical health. For this analysis, we arrived at the following conclusions. First, the cutoff value of 0.15 is suitable for our logistic regression model, and through this model we found that general health condition, gender, other diseases and smoking habit variables largely influence the possibility of heart disease. Second, from the classification tree, not practicing any physical activity impacts the perception of people feeling that their mental health was not good (they were not feeling good for 15 or more days in the past month. Finally, From the classification tree, the probability of smoking for people 30 or older, even if they have a heart disease or not, is more than 50%. Surprisingly drinking alcohol does not have a great impact on any of the target variables we were looking at: heart disease, mental and physical health, however, smoking has an impact on all three of these aspects. Similarly, BMI as a single measure, did not have any effect on any of the target variables. Studies have shown that BMI would not be expected to identify cardiovascular health or illness overall (Harvard, 2022), and these same findings explain that body composition, including percent body fat or amount of muscle mass, can vary by race and ethnic group and thus wont impact on predicting current healt status.

8 References

Centers for Disease Control and Prevention. (2022). Heart Disease Facts. https://www.cdc.gov/heartdisease/facts.htm

Kaggle. Heart Disease Dataset. (2020). https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset

Shmerling, R. (2020). How useful is the body mass index (BMI). Harvard Health Publishing. https://www.health.harvard.edu/blog/how-useful-is-the-body-mass-index-bmi-201603309339#:~:text=BMI%2C%20as%20a%20single%20measure,the%20only%20measure%20of%20health!

Walto, A. (2017). The 5 Key Habits For Long-Term Health, According To Science. Forbes. https://www.forbes.com/sites/alicegwalton/2017/07/27/the-5-habits-that-really-define-longterm-health-according-to-science/?sh=6833625f4286